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Breast cancer is the most common type of cancer occurring mostly in 
females. In recent years, many researchers have devoted to automate 
diagnosis of breast cancer by developing different machine learning model. 
However, the quality and quantity of feature in breast cancer diagnostic 
dataset have significant effect on the accuracy and efficiency of predictive 
model. Feature selection is effective method for reducing the dimensionality 
and improving the accuracy of predictive model. The use of feature selection 
is to determine feature required for training model and to remove irrelevant 
and duplicate feature. Duplicate feature is a feature that is highly correlated 
to another feature. The objective of this study is to conduct experimental 
research on three different feature selection methods for breast cancer 
prediction. Sequential, embedded and chi-square feature selection are 
implemented using breast cancer diagnostic dataset. The study compares the 
performance of sequential embedded and chi-square feature selection on test 
set. The experimental result evidently shows that sequential feature selection 
outperforms as compared to chi-square (X?) statistics and embedded feature 
selection. Overall, sequential feature selection achieves better accuracy of 
98.3% as compared to chi-square (X?) statistics and embedded feature 
selection. 
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1. INTRODUCTION 


Breast cancer is the most common cause of death among women throughout the global 
population [1], [2]. Breast cancer causes the second prevalent number of deaths in women [3]. Thus, early 
prediction of breast cancer is vital to reduce mortality caused by breast cancer. Despite the advances in 
mammography screening systems for early prediction of breast cancer, interpretation of X-rays and limited 
number of experienced oncologist in developing nations such as Ethiopia, high variability of experts’ 
knowledge on breast cancer prediction makes breast cancer prediction more complicated. The decision 
making process during breast cancer prediction needs high accuracy as the outcome is highly risky because 
false positives leads to anxiety and false negatives leads to complications and patient suffers due to lack of 
treatment due to false negative outcome. 

Redundant and duplicate input feature in the Wisconsin’s breast cancer diagnostic dataset increases 
the computational time required for training and testing predictive model for breast cancer detection [4]. 
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Furthermore, very large input feature increases the volume of dataset and larger dataset requires higher 
storage space. High correlation between features also affects the performance of model on breast cancer 
prediction [5]. 

The objective of feature selection is to extract representative feature for describing each of the 
dataset observation [6], [7]. In addition, feature selection reduces the number of dataset feature required to 
describe an observation in dataset. Hence, feature selection essentially reduces the number of input feature 
required to train a model. The reduction of the number of input feature in dataset decreases the computational 
time for training and testing model [8]. Hence, feature selection helps in developing more effective and faster 
model. Different researchers have proposed different types of feature selection methods. Feature selection 
methods are classified into three groups [9], [10] namely filter, embedded and wrapper method. The 
performance of feature selection methods varies across different datasets. The objective of this research is to 
investigate the performance of sequential, embedded and chi-square for breast cancer prediction. Overall, this 
research aims to investigate the answers to the following research questions: 

— What is the performance of sequential feature selection for breast cancer prediction? 

— What is the performance of embedded feature selection for breast cancer prediction? 

— What is the performance of chi-square for breast cancer prediction? 

— How to improve the performance of random forest model for breast cancer prediction? 

The rest of this research is organized as follows: In section 2, the state of the art is presented, section 3 
discusses the methodology, section 4 presents the result and discusses the comparative results and section 5 
finally concludes the research. 


2. LITERATURE SURVEY 

Different researches have been carried out to solve the problem of breast cancer prediction using 
automated machine learning model and there have been different automated model for breast cancer 
prediction. However, breast cancer prediction still needs to be studied as the performance of the existing 
model have larger scope for improvement. Zhang et al. [11], the researchers developed decision tree based 
model for breast cancer prediction, which achieved an accuracy of 97.48%. The experiment on various 
feature subset evidently shows that feature selection is important to obtain good result on breast cancer 
prediction using decision tree model. 

Feature selection is important to enhance the classification accuracy of the predictive accuracy of a 
model for breast cancer prediction. Assegie et al. [12], the researchers have suggested that, the performance of 
decision tree, adaptive boosting model greatly improves when the model is trained on optimal input feature. 
Moreover, optimal feature selection is significant to get insights into dataset and discover important feature 
from breast cancer dataset. The experimental result reveals that accuracy of the developed model is 92.53%. 

Automated predictive model is proved important for breast cancer prediction at early stage and 
increases survival rate of breast cancer patient. Automated model is more accurate than inexperienced human 
experts or oncologist for breast cancer prediction [13]. The authors developed convolutional neural network 
based predictive model for breast cancer detection with predictive accuracy of 97%. In addition to accuracy, 
automated breast cancer prediction model avoids human errors, time and cost incurred for breast cancer 
identification [14]. Moreover, automated breast cancer prediction model avoids extra overload on oncologist 
especially where the number of breast cancer patient is higher. 

In recent years, automated intelligent breast cancer prediction system is implemented with different 
supervised learning algorithms such as, k-nearest neighbor (KNN) and artificial neural network (ANN) [15]. 
However, the performance of the developed model still has scope for improvement for more accurate breast 
cancer prediction. Thus, we are motivated to study the existing work and propose more accurate model for 
breast cancer prediction by employing different feature selection methods, such as chi-square, sequential and 
embedded feature selection method. 


3. RESEARCH METHODOLOGY 

The dataset for this study is obtained from Wisconsin’s breast cancer diagnostic dataset collected 
from Kaggle data repository. To evaluate the developed model, we have employed accuracy (number of 
correct predictions) with 5-fold cross validation. Chi-square, sequential feature selection and model based or 
embedded feature selection using random forest model is evaluated on breast cancer dataset. We have trained 
random forest model on original 30 input features representing 569 observations in the breast cancer dataset. 
Then the model is trained on optimal feature selected by chi-squared sequential feature selection and model 
based feature selection and accuracy is compared on model trained on original features before applying the 
feature selection. 
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3.1. Chi-square (Xô) statistic 

Chi-square is statistical method for feature selection. Chi-square is a typical example for filter based 

feature selection [16]. Chi-square compares two input features and examines if they are related. 

Mathematically, chi-square is defined as (1). 
(0i-E;)Y 


2_È 
poe a) 


Where O denotes observed value and E denotes expected value. The summation symbol shows that the 
calculation is performed on every input feature in dataset. Chi-square test is shows relationship between two 
variables in dataset [17]. Lower value of chi-square shows high correlation between input feature and target 
class or variable. 


3.2. Sequential feature selection 
Sequential feature selection (SFS) method is wrapper feature selection [18], [19]. Sequential feature 
selection selects the last feature or the first feature in the dataset initially. Then one of input feature from the 
remaining input feature is selected randomly and the model performance is compared. The process is 
repeated for all input features and the corresponding accuracy for each input feature subset is calculated. The 
input subset that produces highest accuracy is considered as optimal input feature. 
Sequential feature selection is used to reduce an original N-dimensional input feature sub set to a 
d-dimensional feature set for d. 
Initialize: Subset = 0, M = 0 
fori<di=i+1 
Compute model performance on: Subset = fs, [i] 
computem=m+1 
fsub = fsub + 1 
stop when i = m 


3.3. Embedded feature selection method 

Tree based model such as decision tree and random forests (ensemble of trees) are used for feature 
selection [20]. Decision tree and random forest model is used to calculate feature importance when 
developing a model for determining the best feature and leaving unsuitable feature, with lower feature 
importance score [21], [22]. Random forest is an ensemble model, used as an embedded feature selection 
method, where each decision tree model in the ensemble is implemented by using observations of data from 
the complete dataset. 


3.4. Performance metric 

Accuracy is the most widely employed performance measure for validating the predictive 
performance of classification model [23]-[25]. Hence, in this study we have employed the predictive 
accuracy to evlaute the performance of the developed model. Mathematically, classification accuracy is 
defined as the number of correct predictions (true positives TP and true negatives TN) over all samples in the 
validation set N. 


(TP+TN) 


Accuracy = N 


) (2) 
3.5. Dataset features 

The original breast cancer dataset consists of 33 features. The ranking of the original 33 breast 
cancer diagnostic feature is demonstrated in Figure 1. Figure 1 worst concave points, worst area, mean 
concave points, mean concavity and worst radius has higher importance for breast cancer prediction. Overall, 
each feature of the original breast cancer dataset describing each sample in the dataset are demonstrated in 
Figure 1. 
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Figure 1. Breast cancer dataset features and their importance 


4. RESULT AND DICUSSIONS 
In this section, the experimental result on the performance of different feature selection method is 


presented. Specifically, the optimal feature selected by chi-square, sequential and model based or embedded 
feature selection with random forest algorithm is presented. The performance of feature selection model is 
evaluated against the predictive accuracy achieved when particularly feature selected by the feature selection 
method is used for training random forest model. The experimental result evidently appears to prove that the 
number of features selected by each of the feature selection method is different. 


4.1. Comparison on the performance of feature selection method 

Random forest model is trained on selected input feature subset by chi-square, sequential and 
feature importance and five-fold cross validation accuracy is employed to compare the model performance on 
each of the input feature subset. The comparative result on the performance of sequential embedded and 
chi-square feature selection is summarized in Table 1. As shown in Table 1, the highest accuracy (98.3%) is 
achieved with the feature subset selected sequential feature selection method as compared to chi-square with 
accuracy (95.7%) and embedded or model based feature importance with breast cancer detection accuracy of 
96.3%. 


Table 1. The performance of different feature selection method for breast cancer prediction 
Feature selection method _No. input features selected _ Accuracy 


Chi square 8 95.78% 
Sequential 8 98.3% 
Embedded 8 96.30% 


Different feature selection methods such as chi-square (X°), sequential feature and embedded 
method selects different set of input features as optimal feature. Thus, the performance of a base classifier is 
different for different feature selection methods. Overall, all of the feature selection methods a better and 
more accuracy on breast cancer detection as compared to model trained on the original input feature. 
Sequential feature selection is better method to achieve better performance. The ranking methods such as 
embedded feature selection is good compared to statistical method such as chi-square statistics. We observe 
form Table 1 that the five-fold cross validation accuracy of the proposed random forest model performs better 
on breast cancer detection with feature subset selected using wrapper based sequential feature selection 
method as compared to chi-square statistical method and the embedded method. 
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The performance of sequential, chi-square and embedded feature selection method for breast cancer 
prediction is demonstrated in Figure 2. To compare the performance of the feature selection methods, we 
employed predictive accuracy as performance measure. We observe in Figure 2 that, sequential feature 
selection method outperforms as compared to chi-square and embedded method. Embedded feature selection 
performed better as compared to chi-square statistic being the least performing feature selection method. 


perfromance of feature selection 


Accuracy in % 
`J 
uw 


chi sqaure sequential embeded 


feature selection methods 


Figure 2. Performance of feature selection methods for breast cancer prediction 


5. CONCLUSION 

In this study, we have explored the perfromance of embedded, chi-square and sequnrital feature 
selection by employing breast cancer dataset. The original breast cancer dataset includes 33 feature. 
However, after feature selection, we drop this number from 33 to 8 with accuracy 98.3% using sequential 
feature selection method. The experimental result evidently shows that the accuracy is different for different 
feature selection methods for embedded and chi-square feature selection the accuracy is 95.78% and 96.30% 
respectively. 
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